projpredSEM

Projection predictive variable selection for Bayesian regularized SEM

Sara van Erp

Utrecht University

Outline

  1. The idea: Projection predictive variable selection
  2. Adaptation to SEM: A prototype
  3. Discussion and future directions

Why is projpredSEM needed?

In models with many parameters, regularization is increasingly used to prevent overfitting.

Regularized SEM adds a penalty to the parameters to regularize, e.g., ridge, lasso, or elastic net

\[ F_{regsem}(S, \Sigma(\theta)) = F(S, \Sigma(\theta)) + \lambda P(\theta_{reg}) \] See for example Jacobucci, Grimm, and McArdle (2016) or Huang (2020).

Regularized SEM: An example

See for example Zhang and Liang (2023).

Bayesian regularized SEM

Bayesian regularized SEM is increasingly used.

A shrinkage prior takes the role of the penalty:

\[ p(\theta|y) \propto p(y|\theta)p(\theta) \]

Many different shrinkage priors exist (see e.g., Van Erp, Oberski, and Mulder (2019)).

Ideal shrinkage prior:
1. Peaked around zero
2. Heavy tails

Shrinkage priors

Advantages Bayesian regularized SEM

  • Intuitive interpretation
  • Automatic uncertainty estimates
  • Incorporation of prior information
  • Flexibility in the choice of shrinkage prior
  • Automatic estimation penalty parameter

Disadvantages Bayesian regularized SEM

  • Computationally expensive
  • Limited availability user-friendly software

Software overview (Bayesian) regularized SEM (Van Erp 2023)

Disadvantages Bayesian regularized SEM

  • Computationally expensive
  • Limited availability user-friendly software
  • Parameters are not automatically set to zero

Why is projpredSEM needed?

In Bayesian regularized SEM, parameters are not automatically set to zero.

  • Van Erp, Oberski, and Mulder (2019) showed different conditions require different CIs
  • Zhang, Pan, and Ip (2021) showed different conditions require different (arbitrary) types of criteria
  • Selection criteria based on marginal posteriors might perform differently than joint criteria

Marginal vs. joint posteriors

See also Piironen et al. (2018)

Why is projpredSEM needed?

In Bayesian regularized SEM, parameters are not automatically set to zero.

  • Van Erp, Oberski, and Mulder (2019) showed different conditions require different CIs
  • Zhang, Pan, and Ip (2021) showed different conditions require different (arbitrary) types of criteria
  • Marginal criteria might perform differently than joint criteria

Goal projpredSEM: To provide a formal method of parameter, i.e., model selection.

The idea: Projection predictive variable selection

Goal: Finding a smaller submodel that predicts practically as good as the larger reference model.

  1. Specify a reference model
  2. Project the posterior information of the reference model onto the candidate models
  3. Select the candidate model with the best predictive performance

Projection predictive variable selection explained

Consider a simple linear regression model: \(y = X\beta + \epsilon\) with 20 predictors.

Importantly: projpred (Piironen et al. 2023) is developed for minimal subset selection.

projpred works automatically with brms or rstanarm:

ref_fit <- stan_glm(
  y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 + X12 + X13 + X14 + X15 + X16 + X17 + X18 + X19 + X20,
  family = gaussian(),
  data = dat,
  prior = hs()
)

ref_obj <- get_refmodel(ref_fit)

Projection predictive variable selection explained

Important to assess the quality of the reference model, since the predictive performance of the submodel will only be as good as the reference.

When diagnosing a reference model, there are three primary dimensions we recommend the statistician to investigate:
1. posterior sensitivity to the prior and likelihood;
2. posterior predictive checks;
3. cross-validation and the influence of data on the posterior.

McLatchie et al. (2023)

Projection predictive variable selection explained

Search: determine the solution path

Evaluation: determine the final submodel size

Cross-validation is used for the evaluation part and recommended for the search part to protect against overfitting (Piironen and Vehtari 2017).

cvvs <- cv_varsel(
  ref_obj,
  validate_search = TRUE,
  cv_method = "kfold", # default is loo
  k = 1
)

Projection predictive variable selection explained

Adaptation to SEM: A prototype

Adaptation to SEM: A prototype

Based on a MIMIC model with \(k = 1, \ldots, K\) latent variables \(\pmb{\eta}\), \(p = 1, \ldots, P\) predictor variables \(\pmb{x}\), \(j = 1, \ldots, J\) indicator variables \(\pmb{y}\) for \(i = 1, \ldots, N\) measurements: \[ \begin{aligned} \pmb{\eta} & = \pmb{\Gamma} \pmb{x} + \pmb{\delta} \\ \pmb{y} & = \pmb{\Lambda} \pmb{\eta} + \pmb{\epsilon} \end{aligned} \]

with \(\pmb{x} \sim \pmb{N}_P (\pmb{0}, \pmb{\Phi})\), \(\pmb{\delta} \sim \pmb{N}_K(\pmb{0}, \Psi)\), and \(\pmb{\epsilon} \sim \pmb{N}_J(\pmb{0}, \pmb{\Theta})\).

projpredSEM prototype: Predictive distribution

Based on the model-implied covariance matrix, the predictive distribution for new responses \(\pmb{y}\) given certain values for the predictors \(\pmb{x}_0\) is multivariate normal with mean vector \(\pmb{\mu}_{y|x_0}(\hat{\pmb{\theta}})\) and covariance matrix \(\pmb{\Sigma}_{y|x_0}(\pmb{\hat{\theta}})\), which are given by: (De Rooij et al. 2023)

\[ \begin{aligned} \pmb{\hat{\mu}}_{y|x_0} & = \pmb{\hat{\mu}}_y + \pmb{\hat{\Sigma}}_{xy}^T \pmb{\hat{\Sigma}}_{xx}^{-1} (\pmb{x}_0 - \pmb{\hat{\mu}}_x) \\ \pmb{\hat{\Sigma}}_{y|x_0} & = \pmb{\hat{\Sigma}}_{yy} - \pmb{\hat{\Sigma}}_{xy}^T \pmb{\hat{\Sigma}}_{xx}^{-1} \pmb{\hat{\Sigma}}_{xy} \end{aligned} \]

projpredSEM prototype: Objective function

Our goal is to minimize the Kullback-Leibler divergence between the predictive distribution based on the reference model and the predictive distribution based on the submodel is minimized, i.e.,

\[ \text{min}_{\pmb{\theta}_{sub}} \text{ KL} [\pmb{N}_J(\pmb{\mu}_{y|x_0}(\pmb{\theta}_{ref}), \pmb{\Sigma}_{y|x_0}(\pmb{\theta}_{ref})) \text{ || } \pmb{N}_J(\pmb{\mu}_{y|x_0}(\pmb{\theta}_{sub}), \pmb{\Sigma}_{y|x_0}(\pmb{\theta}_{sub})) ] \] Which leads to the objective function:

\[ f = \text{Tr} (\pmb{\Sigma}_p^{-1} \pmb{\Sigma}) + (\pmb{\mu} - \pmb{\mu}_p)^T \pmb{\Sigma}_p^{-1} (\pmb{\mu} - \pmb{\mu}_p) + \text{log} |\pmb{\Sigma}_p | - \text{log} |\pmb{\Sigma}|- J \]

Illustration of the prototype (1)

p_rel <- 2
p_zero <- 8
mod <- 'F =~ .7*y1 + .7*y2 + .7*y3 + .7*y4 + .7*y5
            F ~ .9*x1 + .9*x2 + 0*x3 + 0*x4 + 0*x5 + 0*x6 + 0*x7 + 0*x8 + 0*x9 + 0*x10
            x1 ~~ 0.8*x2
            x1 ~~ 0.5*x3
            x2 ~~ 0.5*x3'
dat <- simulateData(mod, sample.nobs = nsim, empirical = TRUE)

Illustration of the prototype (1)

Note: lavaan, lslx and marginal CIs handle this case well too.

Illustration of the prototype (2)

ntrain <- 20
ntest <- 15
p_rel <- 2
p_zero <- 8
sim_mod <- 'F =~ .7*y1 + .7*y2 + .7*y3 + .7*y4 + .7*y5
            F ~ .9*x1 + .9*x2 + 0*x3 + 0*x4 + 0*x5 + 0*x6 + 0*x7 + 0*x8 + 0*x9 + 0*x10'
dat_train <- simulateData(sim_mod, sample.nobs = ntrain, empirical = TRUE)
dat_test <- simulateData(sim_mod, sample.nobs = ntest, empirical = TRUE)

Illustration of the prototype (2)

Note: lslx with various penalties does not converge, only elastic net selects correct variables & estimates their effects at 0.005.

Extensions

  • Speed up the algorithm, e.g., by using the factor scores directly
  • Add clusters: projection currently based on posterior means
  • Incorporate cross-validation (currently 2-fold)
  • Incorporate alternative selection criteria (e.g., ELPD)
  • Add other shrinkage priors
  • Extend to other SEMs
  • Make available in user-friendly software (long-term)

Discussion points

  • Which models could benefit most from this predictive approach?
    • How about cross-loadings/residual correlations?
    • Model selection vs. optimizing predictive power
  • Comparison to: existing criteria; classical regularized SEM methods; spike-and-slab priors; decoupled shrinkage and selection (Hahn and Carvalho 2015); other suggestions?

References

De Rooij, Mark, Julian D. Karch, Marjolein Fokkema, Zsuzsa Bakk, Bunga Citra Pratiwi, and Henk Kelderman. 2023. SEM-Based Out-of-Sample Predictions.” Structural Equation Modeling: A Multidisciplinary Journal 30 (1): 132–48. https://doi.org/10.1080/10705511.2022.2061494.
Hahn, P. Richard, and Carlos M. Carvalho. 2015. “Decoupling Shrinkage and Selection in Bayesian Linear Models: A Posterior Summary Perspective.” Journal of the American Statistical Association 110 (509): 435–48. https://doi.org/10.1080/01621459.2014.993077.
Huang, Po-Hsien. 2020. Lslx : Semi-Confirmatory Structural Equation Modeling via Penalized Likelihood.” Journal of Statistical Software 93 (7). https://doi.org/10.18637/jss.v093.i07.
Jacobucci, Ross, Kevin J. Grimm, and John J. McArdle. 2016. “Regularized Structural Equation Modeling.” Structural Equation Modeling: A Multidisciplinary Journal 23 (4): 555–66. https://doi.org/10.1080/10705511.2016.1154793.
McLatchie, Yann, Sölvi Rögnvaldsson, Frank Weber, and Aki Vehtari. 2023. “Robust and Efficient Projection Predictive Inference.” arXiv. http://arxiv.org/abs/2306.15581.
Piironen, Juho, Michael Betancourt, Daniel Simpson, and Aki Vehtari. 2018. “Discussion to ’Uncertainty Quantification for the Horseshoe’ by Stéphanie van Der Pas, Botond Szabó, and Aad van Der Vaart.”
Piironen, Juho, Markus Paasiniemi, Alejandro Catalina, Frank Weber, and Aki Vehtari. 2023. projpred: Projection Predictive Feature Selection.” https://mc-stan.org/projpred/.
Piironen, Juho, and Aki Vehtari. 2017. “Comparison of Bayesian Predictive Methods for Model Selection.” Statistics and Computing 27 (3): 711–35. https://doi.org/10.1007/s11222-016-9649-y.
Van Erp, Sara. 2023. “Bayesian Regularized SEM: Current Capabilities and Constraints.” Psych 5 (3): 814–35. https://doi.org/10.3390/psych5030054.
Van Erp, Sara, Daniel L. Oberski, and Joris Mulder. 2019. “Shrinkage Priors for Bayesian Penalized Regression.” Journal of Mathematical Psychology 89 (April): 31–50. https://doi.org/10.1016/j.jmp.2018.12.004.
Zhang, Lijin, and Xinya Liang. 2023. “Bayesian Regularization in Multiple-Indicators Multiple-Causes Models.” Psychological Methods.
Zhang, Lijin, Junhao Pan, and Edward Haksing Ip. 2021. “Criteria for Parameter Identification in Bayesian Lasso Methods for Covariance Analysis: Comparing Rules for Thresholding, p -Value, and Credible Interval.” Structural Equation Modeling: A Multidisciplinary Journal 28 (6): 941–50. https://doi.org/10.1080/10705511.2021.1945456.